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Abstract 


A dedicated algorithm for sparse spectral representation of music sound is presented. The goal 
is to enable the representation of a piece of music signal as a linear superposition of as few 
spectral components as possible, without affecting the quality of the reproduction. A repre¬ 
sentation of this nature is said to be sparse. In the present context sparsity is accomplished 
by greedy selection of the spectral components, from an overcomplete set called a dictionary. 
The proposed algorithm is tailored to be applied with trigonometric dictionaries. Its distinctive 
feature being that it avoids the need for the actual construction of the whole dictionary, by 
implementing the required operations via the Fast Fourier Transform. The achieved sparsity 
is theoretically equivalent to that rendered by the Orthogonal Matching Pursuit method. The 
contribution of the proposed dedicated implementation is to extend the applicability of the 
standard Orthogonal Matching Pursuit algorithm, by reducing its storage and computational 
demands. The suitability of the approach for producing sparse spectral representation is il¬ 
lustrated by comparison with the traditional method, in the line of the Short Time Fourier 
Transform, involving only the corresponding orthonormal trigonometric basis. 

Keywords: Sparse Representation of Music Signals; Self Projected Matching Pursuit. 
PACS: 43.75.Zz, 43.60 
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I Introduction 


Spectral representation is a classical approach which plays a central role in the analysis and 
modelling of both, music sounds (Serra and Smith, 1990; Fletcher and Rossing, 1998; Davy 
and Godsill, 2003) and acoustic properties of music instruments (Wolfe et ai, 2001). 

Available techniques aiding the spectral analysis of music range from the Fast Fourier 
Transform (FFT) and Short Time Fourier Transform (STFT) to several classes of joint Time 
Frequency/Scale distributions (Aim and Walker, 2002; Smith 2011) and atomic representations 
(Mallat and Zhang, 1993; Gribonval and Bacry, 2003). 

In this Gommunication we focus on the representation of a digital piece of music, as the 
superposition of vectors arising by the discretization of trigonometric functions. The aim is to 
represent segments of a sound signal, as a linear combination of as few spectral components as 
possible without affecting the quality of the sound reproduction. We referrer to the sought rep¬ 
resentation as piecewise sparse spectral representation of music sound. Additionally to the typ¬ 
ical advantages of sparse signal representation, the emerging theory of compressive/compressed 
sensing (Baraniuk, 2007, 2011; Donoho, 2006; Gandes, et al. 2006; Gandes and Wakin, 2008) 
has introduced a renewed strong reason to pursue sparse representation of music. This the¬ 
ory associates sparsity to a new framework for digitalization, beyond the Nyquist/Shannon 
sampling theorem. Within the compressive sensing framework, the number of measurements 
needed for accurate representation of a signal informational content decreases, if the sparsity 
of the representation improves. 

For the class of compressible signals the sparse approximation can be accomplished by rep¬ 
resentation in an orthonormal basis, simply by disregarding the least signihcant terms in the 
decomposition. Melodic music signals are known to be compressible in terms of trigonometric 
orthonormal basis. However, a much higher level of sparsity may be achieved by releasing 
the orthogonality property of the spectral components (Mallat and Zhang, 1993; Gribonval 
and Bacry, 2003; Rebollo-Neira, 2016a). The price to be paid for that is the increment in 
the complexity of the numerical algorithms producing the corresponding sparser approxima¬ 
tion. Practical algorithms for this purpose are known as greedy pursuit strategies (Friedman 
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and Stuetzle, 1981; Jones, 1987; Mallat and Zhang, 1993). In Gribonval and Bacry (2003) a 
dedicated Matching Pnrsnit method for effective implementation of the spectral model is de¬ 
veloped by means of well localized freqnency components of variable length. In Rebollo-Neira 
(2016a) an alternative approach is considered. It involves the approximation of a signal by 
partitioning, according to the following steps: i)The signal is divided into small nnits (blocks) 
ii)Each block is approximated by nonorthogonal spectral components, independently of each 
other bnt somewhat ‘linked’ by a global constraint on sparsity or qnality. The global constraint 
is fnlhlled by establishing a hierarchy for the order in which each element in the partition is to 
be approximated. Thns, the method reqnires significant storage. Even if the global constraint 
is disregarded, and each nnit approximated totally independent of the others, the algorithms 
in Rebollo-Neira (2016a) are effective for partition nnits of moderate length. For nnits of larger 
size there is a need of mathematics algorithms specialized to that sitnation. This is the goal of 
the present work. We propose a dedicated algorithm for nonorthogonal sparse spectral model¬ 
ing which, as a conseqnence of allowing for relatively large elements in a partition, somewhat 
rednces the need for a global constraint on sparsity. This makes it possible for the approxi¬ 
mation of each nnit np to the same qnality and completely independent of the others. The 
approach is, thereby, snitable for straightforward parallelization in mnltiprocessors. As far as 
sparsity is concerned, the results are theoretical equivalents to those produced by the effective 
Orthogonal Matching Pursuit method (Pati et ai, 1993). The particularity of the proposed 
implementation, dedicated to trigonometric dictionaries, is that it avoids the need for storing 
the whole dictionary and reduces the complexity of calculations via the Fast Fourier Transform. 
The relevance of sparse spectral representation with trigonometric dictionaries, in the context 
of music compression with high quality recovery, is illustrated in Rebollo-Neira (2016b). 

The paper is organized as follows: Sec. [^discusses the spectral model outside the traditional 
orthogonal framework. The mathematical methods for operating within the nonorthogonal 
setting are also discussed in this section, motivating the proposed dedicated approach. The 
approach is first explained and then summarized in the form of pseudocodes (Algorithms 1- 


6) given in Appendix A. The examples of Sec. Ill illustrate the benefit of a nonorthogonal 
framework, against the orthogonal one, in relation to the very signihcant gain in the sparsity of 


4 



the spectral representation of music signals for high quality recovery. The results presented in 
this section demonstrate the relevance of the proposed greedy strategy dedicated to be applied 


with trigonometric dictionaries. The conclusions are summarized in Sec. IV 


II Sparse Spectral Representation 


Let’s assume that a sound signal is given by N sample values, /(j), j 
modeled by the following transformation: 

. M 


'' n=l 


27 r(j-l)(n-l) 


3 = 1, 


V. 


1,..., iV, which are 


( 1 ) 


For M = N the set of vectors {;^e* j = 1,..., is an orthonormal basis for 

the subspace of iV- dimensional vectors of complex components. Thus the coefficients in ([^ are 
easily obtained as 

M 

c(n) = ^'pj{j)e-' , n = l,...,M = N. (2) 

Equations Q and ([^ can be evaluated in a fast manner via the FFT. 

Suppose now that M > N. In that case the set m , j = is 

no longer an orthonormal basis but a tight frame (Young, 1980, Daubechies, 1992). From a 
computational viewpoint the difference with the case M = iV is much less pronounced than 
the theoretical difference. Certainly, when dealing with a tight frame the coefficients in ([^ can 
still be calculated via FFT, by zero padding. The differences though with the orthogonal case 
are major. 


i) When M > N the coefficients in the superposition ([^ are not unique. The addition of 
a linear combination with coefficients taken as the components of any vector in the null 
space of the transformation would not affect the reconstruction. 

ii) The tight frame coefficients calculated via FFT, by zero padding, produce the unique 

M 

coefficients minimizing the square norm |c(n)|^. Such a solution is not sparse. 

n=l 

iii) For the case M = N the approximation obtained through ([^, by disregarding coefficients 
of small magnitude, is optimal in the sense of minimizing the norm of the residual error. 
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This is not true when M > A^, in which case the nonzero coefficients need to be re¬ 
calculated to attain the equivalent optimality (Rebollo-Neira, 2007). 


In order to construct an optimal approximation of the data by a representation of the form 
Q, with M > N but containing at most k non zero coefficients, those coefficients have to 
be appropriately calculated. Let’s suppose that we want to involve only the elements n = 
1,..., fc where each is a different member from the set - ,M}. Then the approximation 

model takes the form 


!“(]) 


n=l 


27r(j-l)(£n-l) 

M 


j = 1, 




(3) 


The superscript k in the coefficients c'^{in), n = 1,... ,k indicates that they have to be recal¬ 
culated if some terms are added to (or eliminated from) the model (|^. We address the matter 
of choosing the k elements in ([^ by a dedicated Self Projected Matching Pursuit (SPMP) 
approach (Rebollo-Neira and Bowley, 2013). 


A Self Projected Matching Pursuit 

Before reviewing the general SPMP technique let’s define some basic notation: M, C and 
N represent the sets of real, complex and natural numbers, respectively. Boldface letters are 
used to indicate Euclidean vectors and standard mathematical fonts for their components, e.g., 
d G C'^ is a vector of iV-components d{j) G ,j = 1,... ,N. The operation (•, ■) indicates 
the Euclidean inner product and || ■ || the induced norm, i.e. ||dp = (d,d), with the usual 
inner product definition: For d G and f G 

N 

(f,d) = 

i=i 

where f*{j) stands for the complex conjugate of /(j). 

Let’s consider now a set V oi M normalized to unity vectors V = {d„ G ; ||d„|| = l}n=i 
spanning C^. For M > N the over-complete set T> is called a dictionary and the elements 
are called atoms. Given a signal, as a vector f G C'^, the fc-term atomic decomposition for its 
approximation takes the form 

k 

e = Y^cHln)A,„. ( 4 ) 

n=l 
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The problem of how to select from V the k elements n = 1..., fc, such that ||f^ ~ f || is 
minimal, is an NP-hard problem (Natarajan, 1995). The equivalent problem, that of finding 
the sparsest representation for a given upper bound error, is also NP hard. Hence, in practical 
applications one looks for ‘tractable sparse’ solutions. This is a representation involving a 
number of fc-terms, with k acceptable small in relation to N. Effective techniques available for 
the purpose are in the line of Matching Pursuit Strategies. The seminal approach. Matching 
Pursuit (MP), was introduced with this name in the context of signal processing by Mallat and 
Zhang (1993). Nevertheless, it had appeared previously as a regression technique in statistics 
(Friedman and Stuetzle, 1981) where the convergence property was established (Jones, 1987). 
The MP implementation is very simple. It evolves by successive approximations as follows. 

Let R*’ be the fc-th order residue dehned as R*’ = f — f^, and the index for which 
the corresponding dictionary atom d^^^ yields a maximal value of |(d„,R^)|, n = 1,...M. 
Starting with an initial approximation = 0 and R° = f — f° the algorithm iterates by 
sub-decomposing the fc-th order residue into 

R‘={d4„,R‘)d,„+R‘+‘, (5) 

which defines the residue at order k + 1. Because the atoms are normalized to unity given 
in (|^ is orthogonal to . Hence it is true that 

\\RT = I(d 4 ^,, R'=)|2 -f n = 1,..., M, (6) 

from where one gathers that the dictionary atom yielding a maximal value of |(R^,d„)| 
minimizes ||R^'''^||^. Moreover, it follows from ([^ that at iteration k the MP algorithm results 
in an intermediate representation of the form: 

f = f^ + R^+\ (7) 

with 

k 

= (*) 

n=l 

In the limit fc —)■ cxd the sequence f*’ converges to f, or to the orthogonal projection of f 

onto Ym = spanld^}))!;^ if f were not in Ym (Jones, 1987; Mallat and Zhang, 1993; Partington 
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1997). Nevertheless, if the algorithm is stopped at the /cth-iteration, recovers an approxima¬ 
tion of f with an error eqnal to the norm of the residnal which, if the selected atoms are 
not orthogonal, will not be orthogonal to the snbspace they span. An additional drawback of 
the MP approach is that the selected atoms may not be linearly independent. As illnstrated in 
Rebollo-Neira and Bowley (2013), this drawback may signihcantly compromise sparsity in some 
cases. A rehnement to MP, which does yield an orthogonal projection approximation at each 
step, has been termed Orthogonal Matching Pursnit (OMP) (Pati et ai, 1993). In addition to 
selecting only linearly independent atoms, the OMP approach improves npon MP nnmerical 
convergence rate and therefore amonnts to be, usnally, a better approximation of a signal after 
a hnite nnmber of iterations. OMP provides a decomposition of the signal of the form; 

k 

f = 5]c‘(4)d,„+R‘, (9) 

n=\ 

where the coefficients c^{in) are compnted to guarantee that 

k 

= PyJ, with = span{d£„}^^i. (10) 

n=l 

The coefficients giving rise to the orthogonal projection A^j.f can be calculated as c^{in) = 
(b^, f), where the vectors b^, n = 1,... ,k are biorthogonal to the selected atoms , n = 
1,..., /c and span the identical subspace, i.e., = spanjb^}^^]^ = span{d£^}^^^. These coef- 

hcients yield the unique element G minimizing ||f^ ~ f||- A further optimization of MP, 
called Optimized Orthogonal Matching Pursuit (OOMP) improves on OMP by also selecting 
the atoms yielding stepwise minimization of ||f^ — f|| (Rebollo-Neira and Lowe, 2002). Both 
OMP and OOMP are very effective approaches for processing signals up to some dimension¬ 
ality. They become inapplicable, due to its storage requirements, when the signal dimension 
exceeds some value. Since large signals are approximated by partitioning, up to some size of the 
partition unit both OMP and OOMP are suitable tools. For considering units of size exceeding 
the limit of OMP applicability, the alternative implementation, SPMP, which yields equivalent 
results (Rebollo-Neira and Bowley, 2013) is to be applied. The latter is based on the fact that, 
as already mentioned, the seminal MP approach converges asymptotically to the orthogonal 
projection onto the span of the selected atoms. Hence MP itself can be used to produce an 



orthogonal projection of the data, at each iteration, by self-projections. The orthogonal pro¬ 
jection is realized by snbtracting from the residue its approximation constructed through the 
MP approach, but only using the already selected atoms as dictionary. This avoids the need 
of computing and storing the above introduced vectors b^, n = 1,..., fc, for calculating the 


coefficients in (10). 


The SPMP method progresses as follows (Rebollo-Neira and Bowler, 2013). Given a dic¬ 
tionary V = {d„ G ; ||d„|| = l}n=i and a signal f G C^, set Sq = {0}, f° = 0, and R° = f. 
Starting with A; = 1, at each iteration k implement the steps below. 

i) Apply the MP criterion described above for selecting one atom from V, i.e., select ik such 


that 


4 = argmax|(dn,R*^ ^)| (11) 

71=1,...,M 

and assign Sk = Sk-i U d^. Update the approximation of f as -|- (d^, R^^^jd^ 

and evaluate the new residue R^ = f — f^. 


ii) Approximate R^ using only the selected set Sk as the dictionary, which guarantees the 

asymptotic convergence to the approximation TVj,R^ of R^, where Yk = span{S'fc}, and 
a residue R-*- = R^ — having no component in V^. 

iii) Set ^ -1- iAj,R^, R^ R-*-, k k + 1, and repeat steps i) - iii) until, for a required 

p, the condition ||R^|| < p is reached. 

B Dedicated SPMP algorithm for sparse spectral decomposition 

Even if SPMP reduces the storage requirements for calculating and adapting the coefficients 
of an atomic decomposition, storage and complexity remains an issue for processing a signal 
by partitioning in units of considerable size. Notice that the SPMP method involves repetitive 
calculations of inner products. The advantage of using a trigonometric dictionary, in addi¬ 
tion to rendering highly sparse representations in relation to a trigonometric basis, is that a 
trigonometric dictionary allows the design of a dedicate SPMP implementation, which avoids 
the construction and storage of the actual dictionary by calculating inner products via FFT. 

From now on we shall make use of the knowledge that a piece of music is given by real num- 
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bers, i.e. f G The dictionaries we consider for producing sparse spectral decompositions of 
the data are: the Redundant Discrete Fourier (RDF) dictionary, D-f, the Redundant Discrete 
Cosine (RDC) dictionary, and the Redundant Discrete Sine (RDS) dictionary, dehned 
below. 


r 1 .a^O-nG-i) . AnM 

'DC _ r 1 j _ 1 ^J^M 


— r 1 a — 


sin 

w^in) 




where w^{n) and w^{n), n = 1,..., M are normalization factors as given by 

a/TV if n = 1, 

_;_/7r(n—1) _^_/27r(n— 

if n 7^ 1. 


in^iTZ) ^ ^ • /27r(n—l)N’\ 

^ ^ ^ 'n I sm( \j O sm( ^ ^ ^ ) 


- + 
2 ^ 


2(l-cos(?^(^)) 


f ViV if n = 1, 

(n) = < /"T 

' ^ 1 I N _ SIIH ) sin( „ / 1 

2 2(l-cos(^)) ^ 


For M = N each of the above dictionaries is an orthonormal basis, the Orthogonal Discrete 
Fourier (ODF), Cosine (ODC), and Sine (ODS) basis, henceforth to be denoted as and 

B^ respectively. The joint mixed dictionary UV^, with and having the same 

number of elements, is an orthonormal basis for M = y, the Orthogonal Discrete Cosine-Sine 
(ODCS) basis to be indicated as i3“. If M > y, becomes a Redundant Discrete Cosine 
and Sine (RDCS) dictionary. 

For facilitating the discussion of fast calculation of inner products with trigonometric atoms, 
given a vector y G C^, let’s dehne 

N 


n = 1,..., M. 


( 12 ) 


i=i 


When M = (12) is the Discrete Fourier Transform of vector y G C^, which can be evaluated 


using FFT. If M > iV we can still calculate (12) via FFT by padding with (M — N) zeros 


the vector y. Equation (12) can also be used to calculate inner products with the atoms in 


dictionaries and T>^. Indeed, 

N 


COS 

i=i 


7r(2j - l)(n - 1) 
2M 


y{j) =Re(^e * n, 2M) j , n = 1,..., M. 


(13) 
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and 


^ 7r(2j_l^n—1) ^ j'g *"2VV(y,n,2M)j , n = 2,... ,M +1, (14) 

3 = 1 

where Re(z) indicates the real part of z, Ini(z) its imaginary part, and the notation J^(y, n, 2M) 
implies that the vector y is padded with (2M — N) zeros. 

We associate the dictionaries and to the cases I, II, III, and IV, of the 

dedicated SPMP Algorithm (SPMPTrgFFT), which is developed in Algorithm 6 of Appendix 
A, by reconrse to the procednres given in Algorithms 1-5. 

C Procedures for an implementation of the SPMP method dedi¬ 
cated to trigonometric dictionaries 

Let ns recall once again that the aim of the present work is to be able to apply the SPMP 
algorithm, witch is theoretically eqnivalent to the OMP method, bnt withont evalnating and 
storing the dictionaries or Instead, only the selected atoms are evalnated 

(Algorithm]^ and the inner prodncts are performed via FFT (Algorithm [^ . Apart from that, 
the dedicated implementation follows the steps of the general SPMP method. Some particnlar 
featnres are worth remarking. 

• Notice that for Case I, as a conseqnence of the data being real nnmbers, it holds that 

J^{y,n,M) = — n + 2,M). Hence the atoms can be taken always in pairs, ik 

and (M — 1^ + 2). 

• The procednre for self projection of MP (Algorithm]^, is a recnrsive implementation 

of the selection procednre, bnt the selection is carried ont only over the, say k, already 

selected atoms (Algorithm |^. Then the calcnlation of the relevant inner prodncts is 

M 

worth being carried ont via FFT only for valnes of k larger than — log 2 M. 

• In order to provide all the implementation details of the proposed method in a clear and 
testable manner, we have made pnblicly available a MATLAB version of the psendocodes 
(Algorithms 1-6), as well as the script and the signals which will allow the interested re¬ 
searcher to reprodnce the nnmerical resnlts in this paper The MATLAB rontines shonld 
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be taken only as ‘demonstration material’. They are not intended to be an optimized im¬ 
plementation of the algorithms. Such optimization should depend on the programming 
language used for practical applications. 

Ill Numerical Examples 

We apply now the SPMPTrgFFT algorithm to produce a sparse spectral representation of the 

sound clips listed in Table 1 and Table 2. The approximation is carried out by dividing the 

signals into disjoint pieces fg G q = 1,... ,Q of uniform length W, be., f = Jq=ifq, where 

J indicates a concatenation operation and N = QNj,. 

The purpose of the numerical example is to illustrate the relevance of the method to produce 

sparse spectral representation of music, in comparison to the classical orthogonal representation 

in the line of STFT. Each segment q is approximated up to the same quality. The sparsity 

N 

is measured by the Sparsity Ratio (SR) dehned as SR = —, where K is the total number 
of coefficients in the signal representation, i.e, denoting by kq the number of coefficients for 
approximating the g-th segment K = ^q- 

As a measure of approximation quality we use the standard Signal to Noise Ratio (SNR), 

<?=! 

q=l 

All the clips of Table 1 are approximated up to SNR=35dB. The approximation has been 
carried out using all the dictionaries introduced in Sec. with redundancy four, and all the 
concomitant orthogonal basis. Due to space limitation only the best results produced by a 
dictionary, and by a basis, are reported. The best dictionary results are rendered by the 
mixed dictionary Nevertheless, in the case of a basis the best results are achieved by the 
cosine basis The approximation of all the clips in Table 1 was carried out for partitions 
corresponding to W equal to 512, 1024, 2048, 4096, 8192, and 16384 samples. For space 
limitation only the sparsity results corresponding to all those values of W are shown for the 
hrst two clips of the table. Fig. 1 gives the classic spectrogram for the Flute Exercise and 
Classical Guitar. Fig. 2 shows the values of the SR for those clips, as a function of the partition 
unit size W- As seen in the hgures, for all the values of W, the gain in sparsity produced 


SNR = 10 log 


10 




= 10 log 


10 


E. 


N, 
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Figure 1: (Color online only) Spectrograms of the Flute Exercise clip (left) N = 65536 samples at 
22050 Hz, and that of the Classic Guitar, N = 262144 samples at 44100Hz. Each spectrogram was 
produced using a Hamming window of length 4096 samples and 50% overlap. 
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Figure 2: (Color online only) SR, for the Flute Exercise clips (left) and Classical Guitar (right) 
corresponding to values of Nf, equal to 512, 1024, 2048, 4096, 8192, and 16384 samples. The squares 
are the SR values obtained with the orthogonal basis The circles are the results produced by the 
mixed dictionary redundancy four, by means of the proposed algorithm. 


by the dictionary (represented by the circles in Fig. 2) in relation to the best result for the 
basis (squares in those hgures) is very signihcant. Table 1 shows the values of SR for the clips 
listed in the hrst column, using the basis and the dictionary 1)“ with the methods MP and 
SPMP. The value of Nj, is set as that producing the best SR for the orthogonal basis which, 
as illustrated in the left graph of Fig. 2, is not always the optimal value for the dictionary 
approach. The implementation of the MP algorithm via FFT, which we call MPTrgFFT, is 
ready realized simply by deactivating the self projection step. The clips in Table 1 are played 
with a variety of instruments. The sampling frequencies are: 22050 Hz for the Flute Exercise 
and Himno del Riego, 48000 Hz for the Polyphon, and 44100 Hz for all the other clips. The 
SR varies signihcantly, from the sparsest clip (Oboe in C) to the least sparse one (Polyphon). 
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Clip 

Nb 

SR (H") 

SR (MP) 

SR (SPMP) 

Flnte Exercise 

8192 

6.5 

11.8 

13.9 

Classic Gnitar 

16384 

18.7 

26.6 

31.4 

Rock Piano 

2048 

6.9 

10.2 

12.0 

Pop Piano 

8192 

11.7 

15.1 

18.0 

Rock Ballad 

8192 

6.8 

8.9 

10.5 

Bach Piano 

4096 

11.8 

14.8 

17.4 

Trnmpet Solo 

8192 

8.3 

11.9 

14.7 

Himno del Riego 

4096 

4.9 

7.6 

8.9 

Oboe in C 

16384 

13.7 

44.1 

53.5 

Classical Romance 

8192 

7.2 

11.2 

13.4 

Jazz Organ 

8192 

18.7 

22.5 

28.1 

Marimba Jazz 

1024 

11.8 

15.3 

18.6 

Begana 

2048 

8.5 

10.0 

12.0 

Vibraphone 

2048 

12.7 

20.1 

23.8 

Polyphon 

4096 

3.7 

6.1 

7.1 


Table 1: SR obtained with the basis and the dictionary through the MP and SPMP methods, 
for the clips listed in the first column. The value of the partition unite iV^ is the one corresponding to 
the best SR result with the basis B^^ when Nf, takes the values 512, 1024, 2048, 4096, 8192, and 16384. 


Nevertheless, the gain in sparsity obtained with the trigonometric dictionaries, in relation to 
the best orthogonal basis, is in most cases very signihcant. Notice that drnms are not inclnded 
in the list. The reason being that drnm loops are best approximated when the partition size 
is considerably smaller than for the instrnments in Table 1. Hence, the proposed algorithm is 
not of particnlar help in that case. On the contrary, as discnssed in Sec. [T| a method linking 
the approximation of the elements in the partition throngh a global constraint on sparsity, or 
qnality, is mnch better snited to that sitnation (Rebollo-Neira 2016a). The same holds trne 
for speech signals. Additionally, we nnderstand that drnm loops do not fall within the class 
of mnsic that can be sparsely represented only with trigonometric atoms of the type we are 
considering here. 

In order to compare the improvement in SR prodnced by the SPMP method (SRspmp) over 
the MP one (SRmp) we dehned the relative gain in sparsity as follows: 

G = 55s£MLZ^ioo% (15) 

SRmp 

For the resnlts of Table 1 the mean valne gain is (5 = 19.4% with standard deviation of 2.4%. 
Fig. 3 gives a visnal representation of the implication of the SR valne. The left graphs is a 
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Figure 3: (Color online only) The left graph is the classic spectrogram of the Polyphon clip obtained 
with a Hamming window of length 4096 samples and 50% overlap. The right graph is the sparser 
version of the spectral decomposition, realized by the trigonometric dictionary and the SPMPTrgFFT 
algorithm, on a partition of disjoint units of size N^, = 4096. 

classic spectrogram for the Polyphon clip, which has been re-scaled to have the maximum value 
equal to one. The right graph is the sparse spectral representation constructed with the outputs 
of the SPMPTrgFFT algorithm (also re-scaled to have maximum value equal to one). Because 
the spectrograms are given in dB, and the sparse one has zero entries, the value 10“^^ was 
added to all the spectral power outputs to match scales. 

In order to give a description of local sparsity we consider the local sparsity ratio svg = 
g = l,...,Q, where kq is the number of coefficients in the decomposition of the g-block and 
Nfy the size of the block. For illustration convenience the graphs in Figs. 4 depict the inverse of 
this local measure. The points in those hgures represent the values l/sVq, q = 1,... ,Q. Each 
of these values is located in the horizontal axis at the center of the corresponding block. For 
each signal the size of the block is taken to be the value Nb yielding the largest SR with the 
dictionary approach, for that particular signal. 

The lighter lines in all the graphs of Fig. 4 represent the Flute, Marimba Classic Guitar and 
Pop Piano clips. It is interesting to see that each if the darker lines joining the inverse local 
sparsity points follows, somewhat, the shape of signal’s envelop. This is particularly noticeable 
when a transient occurs. 

As opposed to the method of Serra and Smith (1990), which would model a possible com¬ 
ponent of a sound clip by tracking the evolution of some frequencies along time, but in general 
would produce a signihcant residue, the goal of the proposed sparse spectral representation is to 
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Figure 4: (Color online only) The points joined by the darker line in all the graphs are the values of 
the inverse local sparsity ratio l/sr^, q = 1,... ,Q. The top graphs correspond to the Flute (left) and 
Classic Guitar clips. The bottom graphs correspond to the Marimba (left) and Pop Piano clips. The 
lighter lines represent the signals. 


achieve high quality reconstruction. As indicated by the points in the graphs of Fig. 4, for some 
signals this is attained by a decomposition of low local sparsity in particular blocks. Notice, 
however, that a signal exhibiting such picks of inverse local sparsity may produce, on the whole, 
a SR which is higher than the SR of a signal endowed with more uniform local sparsity, e.g. 
Flute vs Marimba and Pop Piano. The clips of Table 1 are all played with single instruments. 
The rather high value of SNR (35dB) is set to avoid noticeable loss or artifacts in the signal 
reconstruction, which might be easy to detect due to the nature of the sound. Nevertheless, 
for the clips of Table 2, which are played by multiple instruments, for SNR=25dB (and even 
lower) we do not perceive loss or artifacts. Hence, the sparsity results of Table 2 correspond 
to SNR=25dB. Overestimating the required SNR for high quality recovery would produce a 
significant reduction of the SR values. 


Clip 

SR (R") 

SR (MP) 

SR (SPMP) 

Classic Music (sextet) 

12.2 

16.2 

18.4 

Piazzola Tango (quartet) 

10.7 

13.8 

15.7 

Opera (female voice) 

5.6 

7.5 

8.3 

Opera (male voice) 

9.2 

12.0 

13.5 

Bach Fugue (orchestral version) 

8.2 

12.4 

14.1 

Simple Orchestra 

13.1 

17.6 

19.8 


Table 2: SR obtained with the basis and the dictionary through the MP and SPMP methods, 
for the clips listed in the first column. The partition unite size is in all the cases Ni, = 4096 and the 
sampling frequency 44100 Hz. 
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For the results of Table 2 the mean value gain in SR (c.f. (15)) is G = 12.8% with standard 
deviation of 1.2%. 


Remarks on computational complexity: The increment in the computational complex¬ 
ity of SPMPTrgFFT with respect to MPTrgFFT is a factor which accounts for the iterations 
realizing the self-projections. In order to estimate the complexity we indicate by k the double 
average of the number of iterations in the projection step. More specihcally, indicating by Kk 
the number of iterations in the fc-term approximation of a hxed segment q, Kq = -^ Ylk=i 


1 P _ 

and ^ ^ i^q- 

9=1 _ _ 

The value of k gives an estimation of the SPMPTrgFFT complexity: 0{kKM \og 2 M). Since 
for a dictionary of redundancy r the number of elements is M = in order to make clearer the 
influence of the segment’s length in the complexity, this can be expressed as Oij^KrNi, log 2 rNi,). 
The computational complexity of plain MPTrgFFT is given by the complexity of calculating 
inner products via FFT, i.e. O(iPriVf,log 2 rA";,). Hence k gives a measure of the increment of 
complexity introduced by the projections to achieve the desired optimality in the coefficients 
of the approximation. Fig. 5 shows the values of F as a function of the segment’s length N^,. 
The triangles correspond to the Flute Exercise clip the starts to the Classic Guitar clip. Notice 
that for the Flute Exercise the value of k augments significantly for the two larger values of 
while remains practically constant for the Classic Guitar. This feature is in line with the 
fact that, as seen in Fig 2, the SR for those values of Ni, is practically constant for the Classic 
Guitar, but decreases for the Flute Exercise. 


IV Conclusions 


A dedicated method for sparse spectral representation of music sound has been presented. 
The method was devised for the representation to be realized outside the orthogonal basis 
framework. Instead, the spectral components are selected from an overcomplete trigonometric 
dictionary. The suitability of these dictionaries for sparse representation of melodic music, by 
partitioning, was illustrated on a number of sound clips of different nature. While the quality of 
the reconstruction is an input of the algorithm, the method is conceived to achieve high quality 
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Figure 5: (Color online only) Average number of the iterations, 75, for realizing the projection step 
procedure (Algorithm 5) corresponding to partition units of length Nh equal to 512, 1024, 2048, 4096, 
8192, and 16384 samples. The triangles are the values for the flute clip and the starts for the classic 
guitar. 


recovery. Hence, in order to benefit sparsity results the signal partition is realized without 
overlap. The approach has been shown to be worth applying to improve sparsity within the 
class of signal which are compressible in terms of a trigonometric basis. The achieved sparsity 
is theoretically equivalent to that produced by the OMP approach with the identical dictionary. 
The numerical equivalence of both algorithms was verihed when possible. 

In order to facilitate the application of the approach we have made publicly available the 
MATLAB version of Algorithms 1-6 on a dedicated web page^. It is appropriate to stress, 
though, that the routines are not intended to be an optimized implementation of the method. 
On the contrary, they have been produced with the intention of providing an easy to test form 
of the approach. We hope that the MATLAB version of the algorithms will facilitate their 
implementation in appropriate programming languages for practical applications. 
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Notes 

^http://www.nonlinear-approx.info/examples/node02.html 
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Appendix A 


Algorithm 1 Computation of inner product with a trigonometric dictionary via FFT. IP- 
TrgFFT procedure: [IP] =IPTrgFFT(R, M, Case) 


Input: R G M, number of elements in the dictionary, and Case (I , II, or III). 
{Computation of the inner products IP = (d, R) G C^} 

Case I 


''2M 


to compute IP.} 


IP = FFT(R,M)^, 

Case II, III (ci. Q, ||I||) 

(Computation of auxiliary vector Aux G 
Aux = FFT(R,2M) 

Case II ^ 

IP{n) = Re(e*^^^ Aux{n)), u = 1,..., M 

Case III 

IP{n - 1) = Im(e*^^^ Aux{n)), n = 2,..., M + 1 


Algorithm 2 Generation of an atom, given the index and the dictionary type. Trigonometric 
Atom procedure: [d 4 ]=TrgAt(f'fc, M, A, Case) 

Input: Index ik, number of elements in the dictionary M, atom’s dimension N, Case (I, II, 
III or IV). 

Output: Atom d^. 

(Generation of the atom, d^, according to the Gase} 
if Gase=IV then 
M ^ f 

end if 

Gase I 

J 1 2sr0-l)(4-l) . 

M , J = 1,...,A 

Gase II (and Gase IV if ik < 


^4U) = GTTTFT cos(- — -), J = 1, 


m;44) 


2M 


A 


M 

Gase III (and Gase IV if 4 > -^) 

. 1 • ,vr(2j-l)4, . , 

^4U) = sm(- — -), j = l,...,A 


w^{ik 
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Algorithm 3 Atom Selection via FFT. AtSelFFT procedure: [t'fc,c(£fc)] =AtSelFFT(R, M, 
Case) 

Input: Residual R G M number of elements in the dictionary, and Case (I, II, III, or 
IV) 

Output: Index of the selected atom Ik, and MP coefficient c(f'fc) = (d 4 ,R) calculated via 
FFT. 

{Call IPTrgFFT procedure. Algorithm to calculate inner products} 

Case I 

IP=IPTrgFFT(R,M, Case I), 

Cases II and III 
IP=IPTrgFFT(R,M, Case), 

(Selection of the new atom and evaluation of the MP coefficient} 

= argmax |/P(?7,)| 

c(4) = lP{ik) 

Case IV 

M ^ f 

IP==IPTrgFFT(R, M, Case II) 

IP*=IPTrgFFT(R, M, Case III) 

u = max(|/P'^(4)|, |/P®(P)|), with 4 = argmax |JP'^(n)| and t = argmax |/P®(n)| 

if = |/P®(P)| then 
ik = P + M and c(4) = /P44) 

else 

4 = 4 and c(4) = /P44) 

end if 
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Algorithm 4 Atom Re-Selection via FFT. AtReSelFFT procedure: 
[£,c(£)]=AtReSelFFT(R,M,F, Case) 

Input: Residue R G number of dictionary’s elements, M, set of indices of the selected 
atoms F = {in}t=i (if Case=IV both, F'’, indices for atoms in and F^, indices for atoms 
in V^). 

Output: Re-Selected index i (out of the set F) and corresponding MP coefficient c{i) = 
(d£,R),f' E F, calculated via FFT. 

Case I 

IP=IPTrgFFT(R,M, Case I), 

Cases II and III 
IP=IPTrgFFT(R,M, Case ), 

{Selection of the index £ G F} 

i = argmax |JP(n)| 
ner 

c{i) = IP{i) 

Case IV 
M ^ f 

IP^=IPTrgFFT(R, M, Case II) 

IP*=IPTrgFFT(R, M, Case III) 

z/ = max(|/P^(r)|, |/P*(P)|, with P = argmax |/P^(n)| and t = argmax |IP"(n)| 

ner'^ ner® 

if z/ = |/P®(P)| then 

£ = p + M and c(£) = /P"(P) 

else 

£ = P and c(£) = /P"(P) 

end if 
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Algorithm 5 Orthogonal Projection via FFT. ProjMPTrgFFT procedure: 
[R, c]=ProjMPTrgFFT(R, M, c, P, e, Case) 

Input: Residue R G number of elements in the dictionary, M, vectors c with the 
coefficients in the /c-term approximation, set P of selected indices up to iteration k, tolerance 
for the numerical error of the projection e, and Case (I, II, III, or IV). 

Output: Updated residue, R G orthogonal to span{d„}ngr and updated coefficients c 
accounting for the projection. 

{Set fi = 2e to start the algorithm} 

while jj, > e do 

(Select one index from P to construct the approximation of R in span{d„}jjgr} 
[£,c(£)]=AtReSelFFT(R,M,F, Case) 

(Generate the selected atom d^} 
de=TrgAt{i, M, N, Case). 

{Update residue} 

R ^ R - c{i)de 

(Since R is vector of real numbers} 
if Case = I theu 
f = M - £ + 2, 
d£/=TrgAt(U, M, A, Case), 

R^ R-c*(£)d^/ 
eud if 
/i = |c(£)| 

(Update coefficient} 
c{i) ^ c{i) + c{i) 
if Case = I theu 
c{M-i + 2) ^ c*(£) 

eud if 
eud while 

(Rename coefficients and residue to match the output variables} 

c = c, R = R 
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Algorithm 6 Main Algorithm for the proposed SPMP method dedicated to trigono¬ 
metric dictionaries and implemented via FFT. Procedure SPMPTrgFFT: [f^,c,F] = 

SPMPTrgFFT(f,M,p,e, Case) 

Input: Data f G M, number of elements in the dictionary, approximation error p > 0 
and tolerance e > 0 for the numerical realization of the projection Case (I , II, III, or IV). 
Output: Approximated data G Coefficients in the atomic decomposition, c. Indices 
labeling the selected atoms F = {£n}n=i- 
{Initialization} 

Set F = {0}, fo = 0, R° = f, k = 0, p = 2p 
(Begin the algorithm} 

while p > p do 

k = k + 1 

(Select index ik and calculate c{ik)} 

[4, c(4)]=AtSelFFT(R^-^ M, Case) 

(Generate the atom (4)} 
d4=TrgAt(4,M, A, Case) 

Updated F ^ F U 4 

(Calculate approximation and residue} 

ffc ^ ^fc-i _|_ c( 4 )d 4 , and R^ = f — 

(Subtract from R^ the component in span(d„}„gr} 

[R^, c]=ProjMPTrgFFT(R^, M, c, F, e. Case) 

(Update residue, approximation, coefficients, and error} 

Rfc ^ = f — H,*:; c = c, p = ||R^|| 

eud while 
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List of Figures 


1 (Color online only) Spectrograms of the Flute Exercise clip (left) N = 65536 samples 

at 22050 Hz, and that of the Classic Guitar, N = 262144 samples at 44100Hz. Each 
spectrogram was produced using a Hamming window of length 4096 samples and 50% 
overlap. 

2 (Color online only) SR, for the Flute Exercise clips (left) and Classical Guitar (right) 

corresponding to values of Nb equal to 512, 1024, 2048, 4096, 8192, and 16384 samples. 
The squares are the SR values obtained with the orthogonal basis The circles are 
the results produced by the mixed dictionary redundancy four, by means of the 
proposed algorithm. 

3 (Color online only) The left graph is the classic spectrogram of the Polyphon clip 

obtained with a Hamming window of length 4096 samples and 50% overlap. The right 
graph is the sparser version of the spectral decomposition, realized by the trigonometric 
dictionary and the SPMPTrgFET algorithm, on a partition of disjoint units of size 
Nb = 4096. 

4 (Color online only) The points joined by the darker line in all the graphs are the values 

of the inverse local sparsity ratio l/srg, q = 1,... ,Q. The top graphs correspond to the 
Flute (left) and Classic Guitar clips. The bottom graphs correspond to the Marimba 
(left) and Pop Piano clips. The lighter lines represent the signals. 

5 (Color online only) Average number of the iterations, 7t, for realizing the projection 

step procedure (Algorithm 5) corresponding to partition units of length Nb equal to 
512, 1024, 2048, 4096, 8192, and 16384 samples. The triangles are the values for the 
flute clip and the starts for the classic guitar. 







